-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add grok and dissect methods to runtime fields #68088
Conversation
Pinging @elastic/es-search (Team:Search) |
This adds a `grok` and a `dissect` method to runtime fields which returns a `Matcher` style object you can use to get the matched patterns. A fairly simple script to extract the "verb" from an apache log line with `grok` would look like this: ``` String verb = grok('%{COMMONAPACHELOG}').extract(doc["message"].value)?.verb; if (verb != null) { emit(verb); } ``` And `dissect` would look like: ``` String verb = dissect('%{clientip} %{ident} %{auth} [%{@timestamp}] "%{verb} %{request} HTTP/%{httpversion}" %{status} %{size}').extract(doc["message"].value)?.verb; if (verb != null) { emit(verb); } ``` We'll work later to get it down to a clean looking one liner, but for now, this'll do. The `grok` and `dissect` methods are special in that they only run at script compile time. You can't pass non-constants to them. They'll produce compile errors if you send in a bad pattern. This is nice because they can be expensive to "compile" and there are many other optimizations we can make when the patterns are available up front. Closes elastic#67825
/** | ||
* A simple pattern that we can implement against grok and dissect. | ||
*/ | ||
public interface SimplePattern { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of us is particularly happy with this name. Sure, right now it is like java's Pattern
class but with only a single method so, sure, it is "simple" now. But "Simple" just sounds like a bad class name. AbstractingPattern
? NamedMatchesPattern
? GrokOrDissectPattern
? FlavoredPattern
? These sound kind of bad to me too. But maybe better than SimplePattern
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If anyone can come up with a better name for this thing then I'm all ears.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rjernst Pinging for a name :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could use something other than Pattern, since this doesn't actually directly have anything to do with java regexes. In essence it is a functional interface to take a string, and return a map of named substrings. Maybe NamedGroupExtractor
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NamedGroupExtractor
will do just fine. We can always change it later because folks won't compile against it. I hope.
@@ -824,6 +831,48 @@ public void visitInvokeCallMember(InvokeCallMemberNode irInvokeCallMemberNode, C | |||
int j = i; | |||
irInvokeCallMemberNode.getArgumentNodes().get(i).visit(this, (e) -> irInvokeCallMemberNode.getArgumentNodes().set(j, e)); | |||
} | |||
PainlessMethod method = irInvokeCallMemberNode.getDecorationValue(IRDMethod.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the bit that runs the method at compile time if it has the annotation. We only implemented this for statically imported methods defined like from_class
or defined with instance binding because that is all we needed. We did not implement this for regular methods or augments or bound_to
declarations. All of those throw an error when you try to load the PainlessLookup
to warn folks that we never implemented this feature for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added an issue that will allow annotations to determine if they're allowed on a certain whitelist type (#68322).
@@ -1220,7 +1221,13 @@ public void visitConstant(ConstantNode irConstantNode, WriteScope writeScope) { | |||
else if (constant instanceof Byte) methodWriter.push((byte)constant); | |||
else if (constant instanceof Boolean) methodWriter.push((boolean)constant); | |||
else { | |||
throw new IllegalStateException("unexpected constant [" + constant + "]"); | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bit relies on the new compiler phase below.
|
||
import java.lang.reflect.Modifier; | ||
|
||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the secret magic of this PR: if you define a ConstantNode
you used to have to attach a constant that'd fit in the "constant pool". See the instanceof
tree below for what fits. This finds what doesn't fit and causes us to inject the values when we build the script. This is a trick we can do because painless is always embedded in a running process so we can just build what we need at compile time and stick it on the script objects we build. javac
has to write class files and it wouldn't have a place to "put" that state. We're lucky to be embedded, making this fairly easy for us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for this PR, but would it make sense to combine this into the constant folding phase rather than have a separate phase for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to run this "after" all of the constants have resolved, just in case some constant doesn't fit in the constant pool but we end up removing it during the constant folding phase.
@@ -131,7 +131,7 @@ public void testNoArgs() throws Exception { | |||
scriptEngine.compile("testNoArgs3", "_score", NoArgs.CONTEXT, emptyMap())); | |||
assertEquals("cannot resolve symbol [_score]", e.getMessage()); | |||
|
|||
String debug = Debugger.toString(NoArgs.class, "int i = 0", new CompilerSettings()); | |||
String debug = Debugger.toString(NoArgs.class, "int i = 0", new CompilerSettings(), Whitelist.BASE_WHITELISTS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method can take a List<Whitelist>
now so I can test the output using the fancy binding changes. To make sure the constants really get inlined.
@@ -15,6 +15,8 @@ tasks.withType(JavaCompile).configureEach { | |||
dependencies { | |||
compileOnly project(":server") | |||
compileOnly project(':modules:lang-painless:spi') | |||
api project(':libs:elasticsearch-grok') | |||
api project(':libs:elasticsearch-dissect') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I ever know the difference between api
and compileOnly
. I just copied these two lines from ingest-common
.
"runtime_fields.grok.watchdog.max_execution_time", | ||
TimeValue.timeValueSeconds(1), | ||
Setting.Property.NodeScope | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In ingest these are node level settings so I made them node level settings. That might have been an accident of history. Runtime fields usually run inside of a search context so we could maybe do better than the watchdog that we have, but that seems like a problem for another time.
import java.util.Map; | ||
|
||
public abstract class BooleanFieldScript extends AbstractFieldScript { | ||
|
||
public static final ScriptContext<Factory> CONTEXT = newContext("boolean_script_field", Factory.class); | ||
|
||
static List<Whitelist> whitelist() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed so we can plumb grok
and dissect
into place one time.
This creates `.emit()` methods on a few primitive things for `keyword` and `date` style runtime fields. Now you can do stuff like: ``` "d": { "type": "date", "script": "'2020-01-18T17:41:34.000Z'.emit()" } ``` We get to piggy back of painless's dynamic dispatch code to regonize that you are in a `date` and trying to emit a `String` so we can "do the right thing" without any extra runtime cost. In this case we just parse the date using the `formatter` on the field. And since the example above doesn't use a formatter we get ISO8601, the date format of kings. Similarly, you can do stuff like: ``` "s": { "type": "keyword", "script": """ for (int i = 0; i < 100; i++) { i.emit(); } """ } ``` Painless *knows* `i` is an `int` and will call an emit method that emits its string value. Also! Assuming we get the syntax proposed in elastic#68088, because this is a chain of method invocations you can do something like: ``` "i": { "type": "long", "script": """ grok('%{NUMBER:i} %{NUMBER:j}').extract(doc['message'].value)?.i?.emit() """ } ``` This should be read as "if grok matches the message and extracts a value for `i` then emit it to the runtime field." If either the grok doesn't match or doesn't extract an `i` value then nothing will be emitted. As an extra nice thing - we'll automatilly convert whatever the `grok` expression outputs into a `long`. All of it handled for us using painless's standard dynamic dispatch code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for adding this! Very cool change. Added some thoughts for possible clean up/future PRs, but nothing that I would consider necessary for this PR.
|
||
import java.util.Map; | ||
|
||
public class CompileTimeOnlyAnnotationParser implements WhitelistAnnotationParser { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know we haven't been great about this on previous annotations, but would you please add sentence or two on what this one does as a class-style JavaDoc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+++
@@ -393,6 +395,10 @@ public void addPainlessConstructor(Class<?> targetClass, List<Class<?>> typePara | |||
"[[" + targetCanonicalClassName + "], " + typesToCanonicalTypeNames(typeParameters) + "]", iae); | |||
} | |||
|
|||
if (annotations.containsKey(CompileTimeOnlyAnnotation.class)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily for this PR - I wonder if it would make sense to allow annotations to know what they're applied to as part of the annotation parser so we could reject invalid types there where type is constructor, method, field, binding, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. As we add more of these this kind of one off stuff isn't going to be nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added an issue (#68322)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for (int i = 0; i < irInvokeCallMemberNode.getArgumentNodes().size(); i++) { | ||
ExpressionNode argNode = irInvokeCallMemberNode.getArgumentNodes().get(i); | ||
if (argNode instanceof ConstantNode == false) { | ||
// TODO find a better string to output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget the TODO here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got any ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add the text for the non-constant, and the name (I think it's available here as part of the node?) of the method we're trying to call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracking with #68324 so we don't forget because we don't have any good ideas now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I filed the issue I'll do it in a follow up.
} | ||
} | ||
|
||
private void replaceCallWithConstant( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure this utility method is worth it since it's only called once. I prefer for now that the visit methods all kind of be lumped together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's called twice in the method above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, never mind!
|
||
import java.lang.reflect.Modifier; | ||
|
||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for this PR, but would it make sense to combine this into the constant folding phase rather than have a separate phase for it?
…8088) This adds a `grok` and a `dissect` method to runtime fields which returns a `Matcher` style object you can use to get the matched patterns. A fairly simple script to extract the "verb" from an apache log line with `grok` would look like this: ``` String verb = grok('%{COMMONAPACHELOG}').extract(doc["message"].value)?.verb; if (verb != null) { emit(verb); } ``` And `dissect` would look like: ``` String verb = dissect('%{clientip} %{ident} %{auth} [%{@timestamp}] "%{verb} %{request} HTTP/%{httpversion}" %{status} %{size}').extract(doc["message"].value)?.verb; if (verb != null) { emit(verb); } ``` We'll work later to get it down to a clean looking one liner, but for now, this'll do. The `grok` and `dissect` methods are special in that they only run at script compile time. You can't pass non-constants to them. They'll produce compile errors if you send in a bad pattern. This is nice because they can be expensive to "compile" and there are many other optimizations we can make when the patterns are available up front. Closes elastic#67825
…68332) This adds a `grok` and a `dissect` method to runtime fields which returns a `Matcher` style object you can use to get the matched patterns. A fairly simple script to extract the "verb" from an apache log line with `grok` would look like this: ``` String verb = grok('%{COMMONAPACHELOG}').extract(doc["message"].value)?.verb; if (verb != null) { emit(verb); } ``` And `dissect` would look like: ``` String verb = dissect('%{clientip} %{ident} %{auth} [%{@timestamp}] "%{verb} %{request} HTTP/%{httpversion}" %{status} %{size}').extract(doc["message"].value)?.verb; if (verb != null) { emit(verb); } ``` We'll work later to get it down to a clean looking one liner, but for now, this'll do. The `grok` and `dissect` methods are special in that they only run at script compile time. You can't pass non-constants to them. They'll produce compile errors if you send in a bad pattern. This is nice because they can be expensive to "compile" and there are many other optimizations we can make when the patterns are available up front. Closes #67825
Replaces the double `Pattern.compile` invocations in painless scripts with the fancy constant injection we added in elastic#68088. This caused one of the tests to fail. It turns out that we weren't fully iterating the IR tree during the constant folding phases. I started experimenting and added a ton of tests that failed. Then I fixed them by changing the IR tree walking code.
Replaces the double `Pattern.compile` invocations in painless scripts with the fancy constant injection we added in #68088. This caused one of the tests to fail. It turns out that we weren't fully iterating the IR tree during the constant folding phases. I started experimenting and added a ton of tests that failed. Then I fixed them by changing the IR tree walking code.
Replaces the double `Pattern.compile` invocations in painless scripts with the fancy constant injection we added in elastic#68088. This caused one of the tests to fail. It turns out that we weren't fully iterating the IR tree during the constant folding phases. I started experimenting and added a ton of tests that failed. Then I fixed them by changing the IR tree walking code.
Replaces the double `Pattern.compile` invocations in painless scripts with the fancy constant injection we added in #68088. This caused one of the tests to fail. It turns out that we weren't fully iterating the IR tree during the constant folding phases. I started experimenting and added a ton of tests that failed. Then I fixed them by changing the IR tree walking code.
As of elastic#68088 painless can have methods where all parameters must be constant. This improves the error message when the parameter isn't. It's still not super great, but its better and its what we can easilly give at that point in the compiler.
As of #68088 painless can have methods where all parameters must be constant. This improves the error message when the parameter isn't. It's still not super great, but its better and its what we can easilly give at that point in the compiler.
As of elastic#68088 painless can have methods where all parameters must be constant. This improves the error message when the parameter isn't. It's still not super great, but its better and its what we can easilly give at that point in the compiler.
As of #68088 painless can have methods where all parameters must be constant. This improves the error message when the parameter isn't. It's still not super great, but its better and its what we can easilly give at that point in the compiler.
This adds a
grok
and adissect
method to runtime fields whichreturns a
Matcher
style object you can use to get the matchedpatterns. A fairly simple script to extract the "verb" from an apache
log line with
grok
would look like this:And
dissect
would look like:We'll work later to get it down to a clean looking one liner, but for
now, this'll do.
The
grok
anddissect
methods are special in that they only run atscript compile time. You can't pass non-constants to them. They'll
produce compile errors if you send in a bad pattern. This is nice
because they can be expensive to "compile" and there are many other
optimizations we can make when the patterns are available up front.
Closes #67825